rptest/datalake: test metadata interoperability with 3rd party system #24643

nvartolomei · 2024-12-23T17:09:59Z

Backports Required

Release Notes

none

tests/rptest/tests/datalake/3rdparty_rewrite.py

vbotbuildovich · 2024-12-23T23:34:19Z

Retry command for Build#60089

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/enterprise_features_license_test.py::EnterpriseFeaturesTest.test_enable_features@{"disable_trial":false,"feature":5,"install_license":true}

vbotbuildovich · 2024-12-23T23:54:05Z

CI test results

test results on build#60089

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60089#0193f576-ce17-455b-91e6-2f265aaa9185	FLAKY	2/6
rptest.tests.enterprise_features_license_test.EnterpriseFeaturesTest.test_enable_features.feature=Feature.oidc.install_license=True.disable_trial=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/60089#0193f576-ce16-4642-b44b-4eb53b3c9ee5	FAIL	0/1

test results on build#60100

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60100#0193f642-b1a2-4bf2-af5d-e55ed5e9b6f4	FLAKY	1/6

test results on build#60178

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60178#01941785-f08b-45dd-85b4-2851b2d2bb86	FLAKY	2/6
rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.local.target.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True.cloud_storage_type=CloudStorageType.ABS	ducktape	https://buildkite.com/redpanda/redpanda/builds/60178#01941785-f089-4e4a-9d39-60eb4a8babe0	FAIL	0/1

andrwng · 2024-12-24T07:32:59Z

tests/rptest/services/spark_service.py

+    def run_sample_maintenance_task(self, namespace, table) -> None:
+        # Metadata query
+        # https://iceberg.apache.org/docs/1.6.1/spark-queries/#files
+        initial_parquet_files = self.run_query_fetch_one(
+            f"SELECT count(*) FROM {namespace}.{table}.files")[0]
+
+        # Want at least 2 files to be able to assert that optimization did something.
+        assert initial_parquet_files >= 2, f"Expecting at least 2 files, got {initial_parquet_files}"
+
+        # Spark Procedures provided by Iceberg SQL Extensions
+        # https://iceberg.apache.org/docs/1.6.1/spark-procedures/#rewrite_data_files
+        self.run_query_fetch_one(
+            f"CALL `redpanda-iceberg-catalog`.system.rewrite_data_files(\"{namespace}.{table}\")"
+        )
+
+        optimized_parquet_files = self.run_query_fetch_one(
+            f"SELECT count(*) FROM {namespace}.{table}.files")[0]
+        assert optimized_parquet_files < initial_parquet_files, f"Expecting fewer files after optimize, got {optimized_parquet_files}"


IMO these service classes should only hold operations, not validations. To that end, can we split this into some count_parquet_files() and optimize_parquet_files()? There are many different kinds of maintenance https://iceberg.apache.org/docs/1.5.1/maintenance/ and it'll be much easier to reuse if we're more descriptive and have smaller, tighter primitives.

@andrwng I see what you mean but here I wanted to give more freedom to these maintenance tasks implementation, i.e. it might as well be a metadata only maintenance. I documented the expectations in the code doc

redpanda/tests/rptest/tests/datalake/query_engine_base.py

Line 70 in 9874fa0

def run_sample_maintenance_task(self, namespace, table) -> None:

optimize_parquet_files and count_parquet_files would work for the 2 engines we have now, do you reckon all of them expose the same primitives so that we could have a single test for all of them?

I agree that run_sample_maintenance_task() seems a bit vague, whereas the iceberg docs Andrew provided has a concrete list of tasks we could enumerate, these being:

ExpireSnapshots

DeleteOrphanFiles

RewriteDataFiles

RewriteManifests

Maybe these are the primitives we should be working with and leave the assertions to the user.

Maybe these are the primitives we should be working with and leave the assertions to the user.

Assertions like ...? I don't understand what the exact proposal is.

I believe Willem's referring to assert statements in run_sample_maintenance_task() -- to my earlier point, I feel strongly that they don't belong in this Spark service library. They belong in test bodies, or other test verifiers.

here I wanted to give more freedom to these maintenance tasks implementation...do you reckon all of them expose the same primitives so that we could have a single test for all of them?

I can see how it's a nice property to have, that most query engines can implement some kind of maintenance, and so would be able to slot into the new test. To your point, I don't think every engine will always support the exact set of maintenance operations I posted/suggested. IMO once we come across such an engine, we should decide what to do at that time, rather than preemptively introducing a loosely defined primitive -- it's hard for me to imagine this maintenance function will be used widely by test authors.

If you insist that "arbitrary maintenance" belongs here, then please also at least split out optimize_parquet_files and count_parquet_files (or your preferred names) into the Spark and Trino services and have run_sample_maintenance_task() call them.

tests/rptest/tests/datalake/3rdparty_maintenance_test.py

tests/rptest/tests/datalake/query_engine_base.py

tests/rptest/tests/datalake/3rdparty_maintenance_test.py

Will be used for testing interoperability with 3rd party vendors.

vbotbuildovich · 2024-12-30T14:32:17Z

Retry command for Build#60178

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/e2e_topic_recovery_test.py::EndToEndTopicRecovery.test_restore_with_aborted_tx@{"cloud_storage_type":2,"recovery_overrides":{"redpanda.remote.read":true,"redpanda.remote.write":true,"retention.local.target.bytes":1024}}

dotnwat · 2025-01-06T16:17:57Z

got a merge conflict

dotnwat · 2025-01-06T18:37:04Z

tests/rptest/tests/datalake/query_engine_base.py

-                cursor.execute(query)
-                yield cursor
-            finally:
-                cursor.close()


was this redundant? it seems like it was handling the cursor closing, and the outer try block was handling the client closing.

dotnwat · 2025-01-06T18:39:29Z

tests/rptest/tests/datalake/query_engine_base.py

@@ -60,3 +60,16 @@ def max_translated_offset(self, namespace, table, partition) -> int:
        query = f"select max(redpanda.offset) from {namespace}.{self.escape_identifier(table)} where redpanda.partition={partition}"
        with self.run_query(query) as cursor:
            return cursor.fetchone()[0]
+
+    def run_sample_maintenance_task(self, namespace, table) -> None:


i'm not sure what _sample_ in the name refers to

WillemKauf reviewed Dec 23, 2024

View reviewed changes

tests/rptest/tests/datalake/3rdparty_rewrite.py Outdated Show resolved Hide resolved

nvartolomei force-pushed the nv/iceberg-3rdparty-rewrite branch 2 times, most recently from 6f1f148 to 6e607dd Compare December 23, 2024 20:21

rptest/s: remove redundant try block

87dd697

nvartolomei force-pushed the nv/iceberg-3rdparty-rewrite branch 2 times, most recently from 1292bb1 to 6e1e926 Compare December 24, 2024 00:06

andrwng reviewed Dec 24, 2024

View reviewed changes

nvartolomei added 6 commits December 30, 2024 10:38

rptest/s: add query_engine method to datalake services

a94c964

rptest/s: add run_query_fetch_one method to query engine

061e0c9

rptest/s: add methods for sample maintenance tasks

1a5a6c1

Will be used for testing interoperability with 3rd party vendors.

rptest/s: add support for sample maintenance tasks for spark and trino

5d2a772

rptest/s: add DatalakeVerifier.oneshot method

d75a576

rptest/datalake: test metadata interoperability with 3rd party system

92f74aa

nvartolomei force-pushed the nv/iceberg-3rdparty-rewrite branch 2 times, most recently from 4b21a17 to 92f74aa Compare December 30, 2024 10:39

nvartolomei requested review from andrwng and WillemKauf January 2, 2025 08:33

dotnwat reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rptest/datalake: test metadata interoperability with 3rd party system #24643

rptest/datalake: test metadata interoperability with 3rd party system #24643

nvartolomei commented Dec 23, 2024 •

edited

Loading

vbotbuildovich commented Dec 23, 2024

vbotbuildovich commented Dec 23, 2024 •

edited

Loading

andrwng Dec 24, 2024

nvartolomei Dec 30, 2024

WillemKauf Jan 3, 2025

nvartolomei Jan 6, 2025

andrwng Jan 7, 2025

vbotbuildovich commented Dec 30, 2024

dotnwat commented Jan 6, 2025

dotnwat Jan 6, 2025

dotnwat Jan 6, 2025

rptest/datalake: test metadata interoperability with 3rd party system #24643

Are you sure you want to change the base?

rptest/datalake: test metadata interoperability with 3rd party system #24643

Conversation

nvartolomei commented Dec 23, 2024 • edited Loading

Backports Required

Release Notes

vbotbuildovich commented Dec 23, 2024

Retry command for Build#60089

vbotbuildovich commented Dec 23, 2024 • edited Loading

CI test results

andrwng Dec 24, 2024

Choose a reason for hiding this comment

nvartolomei Dec 30, 2024

Choose a reason for hiding this comment

WillemKauf Jan 3, 2025

Choose a reason for hiding this comment

nvartolomei Jan 6, 2025

Choose a reason for hiding this comment

andrwng Jan 7, 2025

Choose a reason for hiding this comment

vbotbuildovich commented Dec 30, 2024

Retry command for Build#60178

dotnwat commented Jan 6, 2025

dotnwat Jan 6, 2025

Choose a reason for hiding this comment

dotnwat Jan 6, 2025

Choose a reason for hiding this comment

nvartolomei commented Dec 23, 2024 •

edited

Loading

vbotbuildovich commented Dec 23, 2024 •

edited

Loading